A New Family of Near-metrics for Universal Similarity

نویسندگان

  • Chu Wang
  • Iraj Saniee
  • William Sean Kennedy
  • Christopher A. White
چکیده

We propose a family of near-metrics based on local graph diffusion to capture similarity for a wide class of data sets. These quasi-metametrics, as their name suggests, dispense with one or two standard axioms of metric spaces, specifically distinguishability and symmetry, so that similarity between data points of arbitrary type and form could be measured broadly and effectively. The proposed near-metric family includes the forward k-step diffusion and its reverse, typically on the graph consisting of data objects and their features. By construction, this family of near-metrics is particularly appropriate for categorical data, continuous data, and vector representations of images and text extracted via deep learning approaches. We conduct extensive experiments to evaluate the performance of this family of similarity measures and compare and contrast with traditional measures of similarity used for each specific application and with the ground truth when available. We show that for structured data including categorical and continuous data, the near-metrics corresponding to normalized forward k-step diffusion (k small) work as one of the best performing similarity measures; for vector representations of text and images including those extracted from deep learning, the near-metrics derived from normalized and reverse k-step graph diffusion (k very small) exhibit outstanding ability to distinguish data points from different classes. ∗Nokia Bell Labs, 600 Mountain Avenue, Murray Hill, NJ 07974 (chu.wang, iraj.saniee, william.kennedy, [email protected]). 1 ar X iv :1 70 7. 06 90 3v 3 [ st at .M L ] 1 7 O ct 2 01 7

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Providing a Link Prediction Model based on Structural and Homophily Similarity in Social Networks

In recent years, with the growing number of online social networks, these networks have become one of the best markets for advertising and commerce, so studying these networks is very important. Most online social networks are growing and changing with new communications (new edges). Forecasting new edges in online social networks can give us a better understanding of the growth of these networ...

متن کامل

Simulated annealing and artificial immune system algorithms for cell formation with part family clustering

Cell formation problem (CFP) is one of the main problems in cellular manufacturing systems. Minimizing exceptional elements and voids is one of the common objectives in the CFP. The purpose of the present study is to propose a new model for cellular manufacturing systems to group parts and machines in dedicated cells using a part-machine incidence matrix to minimize the voids. After identifying...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

An Effective Approach for Robust Metric Learning in the Presence of Label Noise

Many algorithms in machine learning, pattern recognition, and data mining are based on a similarity/distance measure. For example, the kNN classifier and clustering algorithms such as k-means require a similarity/distance function. Also, in Content-Based Information Retrieval (CBIR) systems, we need to rank the retrieved objects based on the similarity to the query. As generic measures such as ...

متن کامل

Using the Taxonomy and the Metrics: What to Study When and Why; Comment on “Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review”

Dukhanin and colleagues’ taxonomy of metrics for patient engagement at the organizational and system levels has great potential for supporting more careful and useful evaluations of this ever-growing phenomenon. This commentary highlights the central importance to the taxonomy of metrics assessing the extent of meaningful participation in decision-making by patients, consumers and community mem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.06903  شماره 

صفحات  -

تاریخ انتشار 2017